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Abstract 

This paper considers the problem of clustering a partially observed unweighted graph - i.e. one 
where for some node pairs we know there is an edge between them, for some others we know there is no 
edge, and for the remaining we do not know whether or not there is an edge. We want to organize the 
nodes into disjoint clusters so that there is relatively dense (observed) connectivity within clusters, and 
sparse across clusters. 

We take a novel yet natural approach to this problem, by focusing on finding the clustering that 
minimizes the number of "disagreements" - i.e. the sum of the number of (observed) missing edges 
within clusters, and (observed) present edges across clusters. Our algorithm uses convex optimization; 
its basis is a reduction of disagreement minimization to the problem of recovering an (unknown) low-rank 
matrix and an (unknown) sparse matrix from their partially observed sum. We evaluate our algorithm 
on the natural and classical Planted Partition/Stochastic Block Model of clustered random graphs; our 
main result characterizes when our algorithm succeeds as a function of minimum cluster size, edge noise 
and observation probability. When there are a constant number of clusters of equal size, our results are 
optimal up to logarithmic factors. 



1 Introduction 



This paper is about the following task: given partial observation of an undirected unweighted graph, 
partition the nodes into disjoint clusters so that there are dense connections within clusters, and sparse 
connections across clusters. By partial observation, we mean that for some node pairs we know if there is 
an edge or not, and for the other node pairs we do not know - these pairs are unobserved. This problem 
arises in several fields across science and engineering. For example, in sponsored search, each cluster is 
a submarket that represents a specific group of advertisers that do most of their spending on a group of 
query phrases - see e.g. Yahoo!-Inc (2009) for such a project at Yahoo. In VLSI and design automation, 



it is useful in minimizing signaling between components, layout etc. - see e.g. Kernighan and Lin (1970) 



and references thereof. In social networks, clusters may represent groups of people with similar interest 



or background; finding clusters enables better recommendations, link prediction, etc (Mishra et al., 2007). 



In the analysis of document databases, clustering the citation graph is often an essential and informative 



first step (Ester et al. 1995). In this paper, we will focus not on specific application domains, but rather 



on the basic graph clustering problem itself. 
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Partially observed graphs appear in many applications. For example, in online social networks like 
Facebook, we observe an edge/no edge between two users when they accept each other as a friend or 
explicitly decline a friendship suggestion. For the other user pairs, however, we simply have no friendship 
information between them, which are thus unobserved. More generally, we have partial observations 
whenever obtaining similarity data is difficult or expensive (e.g. because it requires human participation). 
In these applications, it is often the case that most pairs are unobserved, which is the regime we are 
particularly interested in. 

As with any clustering problem, this needs a precise mathematical definition of the clustering criterion 
with potentially a guaranteed performance. There is relatively few existing work with provable performance 
guarantees for partially observed graphs. Even most existing approaches to clustering fully observed graphs 
either require an additional input (e.g. the number of clusters k required for spectral or fc-means clustering 
approaches), or do not guarantee the performance of the clustering. We review these results in Section 
below, and show that our results extend the known guarantees there for partially observed graphs. 

Our Formulation: We focus on a natural formulation, one that does not require any other extraneous 
input besides the graph itself. It is based on minimizing disagreements, which we now define. Consider 
any candidate clustering; this will have (a) observed node pairs that are in different clusters, but have an 
edge between them, and (b) observed node pairs that are in the same cluster, but do not have an edge 
between them. The total number of node pairs of types (a) and (b) is the number of disagreements between 
the clustering and the given graph. We focus on the problem of finding the optimal clustering - one that 
minimizes the number of disagreements. Note that we do not pre-specify the number of clusters. For the 
special case of fully observed graphs, this formulation is exactly the same as the problem of "Correlation 



Clustering", first proposed by Bansal et al. (2002). They showed that exact minimization of the above 
objective is NP-complete in the worst case - we survey and compare this and other related work in Section 
As we will see, our approach and results are very different. 

Our Approach and Results: We aim to achieve the combinatorial disagreement minimization ob- 
jective using matrix splitting via convex optimization. In particular, as we show in section [2] below, one can 
represent the adjacency matrix of the given graph as the sum of an unknown low-rank matrix (correspond- 
ing to "ideal" clusters) and a sparse matrix (corresponding to disagreements from this "ideal" in the given 
graph). Our algorithm either returns a clustering, which is guaranteed to be disagreement minimizing, 
or returns a "failure" - it never returns a sub-optimal clustering. For our main result, we evaluate our 
algorithm's performance on the natural and classical Planted Partition /Stochastic Block Model of clustered 
random graphs with partial observations; the model and results are outlined in Section [2j We prove our 
theoretical results in Section [3] and provide empirical results in Section |4j Our analysis provides stronger 



guarantees than are current results on general matrix splitting ( Chandrasekaran et al. , 2011 Candes et al 



2009 


Hsu et al. 


2010 


Li 



1.1 Related Work 

Our problem can be interpreted in the general clustering context as one in which the presence of an edge 
between two points indicates a "similarity", and the lack of an edge means no similarity. The general 
field of clustering is of course vast, and a detailed survey of all methods therein is beyond our scope here. 
We focus instead on the two sets of papers most relevant to the problem here: the work on Correlation 
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Clustering, and that on the Planted Partition/Stochastic Block-Model. 



Correlation Clustering: As mentioned, for a completely observed graph, our problem is mathemat- 



ically precisely the same as Correlation Clustering formulated in Bansal et al. (2002); in particular a "+" in 



correlation clustering corresponds to an edge in the graph, and a "-" to the lack of an edge. Disagreements 
are defined in the same way. Thus, this paper can equivalently be considered as an algorithm, and guar- 
antees, for correlation clustering under partial observations. Since correlation clustering is NP-complete 
in the worst case (Bansal et al. 2002), there has been much work on devising alternative approximation 



algorithms (Bansal et al. 



2002 



Becker 



2005 



Emmanuel and Fiat 2003). Approximations using convex 
optimization, including LP relaxation (Charikar et al.l|2003[|Demaine and Immorlical |2003[|Demaine et al 



■ 



2005) and SDP relaxation (Swamy, 2004 Mathieu and Schudy, 2010), followed by rounding, have also been 



developed. We emphasize that we use different convex relaxation, and we do not do rounding; rather, our 
convex program itself yields an optimal clustering. 



We note that the result in Mathieu and Schudy (2010) uses a convex formulation with constraints 



enforcing positive semi-definiteness, triangle inequality and fixed diagonal entries. For the fully observed 
case, their relaxation is at least as tight as ours, and since they add more constraints, it is possible that 
there are instances where their convex program works and ours does not. However, this seems hard to 
prove/disprove. Indeed, in the restricted setting they considered, their theoretical guarantee is identical to 
ours. Moreover, as we argue in the next section, our guarantees are order-wise optimal in some important 
cases and thus cannot be improved even with a tighter relaxation. Practically, our method is faster since, 
to the best of our knowledge, there is no low-complexity algorithm to deal with the positive semi-definite 



and linear constraints required by Mathieu and Schudy (2010). This means that our method can handle 
very large graphs while their result is practically restricted to small ones (~ 100 nodes). In summary, their 
approach has much higher computational complexity, and does not provide significant and characterizable 
gain in performance. 



Planted Partition Model: The planted partition model, a.k.a. stochastic block- model (Condon 



and Karp, 2001 Holland et al. , 1983), assumes that the graph is generated with in-cluster edge probability 



p and inter-cluster edge probability q (p > q) and fully observed. The goal is to recover the latent cluster 
structure. A class of this model with r = max{l — p, q} < | is often used as benchmark for average 
case performance for correlation clustering (see, e.g., |Mathieu and Schudy (2010)). Our theoretical results 
are applicable to this model and thus directly comparable with existing work in this area. A detailed 
comparison is provided in Table [T] For fully observed graphs, our result matches the previous best bounds 
in both the minimum cluster size and the difference between in-cluster /inter-cluster densities. We would 
like to point out that nuclear norm minimization has been used to solve the closely related planted clique 



problem (Alon et al. 1998; Ames and Vavasis, 2011). 



Partially Observed Graphs: All the previous work listed in Table [TJ except Oymak and Hassibi 



(2011), do not handle partial observations directly. One natural way to proceed is to impute the missing 



observations with no-edge, or random edges with symmetric probabilities, and then apply any of the results 
in Table [1} This a pproach, however, leads to sub-optimal results. Indeed, this is done explicitly in Oymak 



and Hassibi 



(2011 ), and they require the probability of observation po to satisfy po > ^ K ™ iu ; where n is the 



number of nodes and K m i n is the minimum cluster size. In contrast, our approach only needs po > 



A': 



which is order-wise better. Shamir and Tishby (2011) deals with partial observations directly and shows 
that po > - suffices for recovering two clusters of size f2(n). Our result applies to much smaller clusters of 
size £l{yjn). In addition, a nice feature of our result is that it gives an explicit tradeoff between the three 
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relevant parameters: po, t, and -Kmin; theoretical result like this is not available in previous works. 

We note that there exist other works that consider partial observations, but under rather different 



settings. For example, Balcan and Gupta (2010), Voevodski et al. (2010) and Krishnamurthy et al. (2012) 



consider the problem where one samples the rows/columns of the adjacency matrix A rather than its 
Hunter and Strohmer ( 2010[) consider partial observations in the features, rather than in the 



entries. 



similarity graph. 



Eriksson et al. 



(2011) show that O(n) actively selected pairwise similarities are sufficient 
for recovering the hierarchical clustering structure. Their results seem to rely heavily on the hierarchical 
structure. When disagreements are present, the first split of tree can only recovers clusters of size O(n). 
Moreover, they require control over the observation process, while we assume random observations. 
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Table 1: Comparison with literature. This table shows the lower-bound requirements on the minimum 
cluster size K and the density difference p — q = 1 — 2r that existing literature needs for exact recovery 
of the planted partitions, when the graph is fully observed and r = max{l — p, q} = 0(1). Some of the 
results in the table only guarantee recovering the membership of most, instead of all, nodes. To force a 
fair comparison, we use the soft-0 notation O(-). It hides the logarithmic factors that are necessary for 
recovering all nodes, which is the goal of this paper. (*) The result of Sussman et al. (2012) is stated in 
terms of parameters related to the factorization of the density matrix, which does not translate directly 
into the density difference of the standard planted partition model. 
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2 Main Contributions 



2.1 Algorithm 



Our algorithm is based on convex optimization, and either (a) outputs a clustering that is guaranteed to 
be the one that minimizes the number of observed disagreements, or (b) declares "failure". In particular, it 
never produces a suboptimal clustering^] We now briefly present the main idea, then describe the algorithm, 
and finally present our main results - analytical characterizations of when the algorithm succeeds. 

Setup: We are given a partially observed graph, whose adjacency matrix is A - which has = 1 if 
there is an edge between nodes i and j, aij = if there is no edge, and =? if we do not know. (Here 
we follow the convention that an = for all i.) We want to find the optimal clustering, i.e. the one that 
has the minimum number of disagreements in fiobsi the set of observed node pairs. 

Idea: Consider first the fully observed case, i.e. every <%• = or 1. Suppose also that the graph 
is already ideally clustered - i.e. there is a partition of the nodes such that there are no edges between 
clusters, and each cluster is a clique. In this case, the matrix A + I is now a low-rank matrix, with the 
rank equal to the number of clusters. This can be seen by noticing that if we re-ordered the rows and 
columns so that clusters appear together, the result would be a block- diagonal matrix, with each block 
being an all-ones sub-matrix - and thus rank one. Of course, this re-ordering does not change the rank of 
the matrix, and hence A + I is (exactly) low-rank. 

Consider now any given graph, still fully observed. In light of the above, we are looking for a decom- 
position of its A + I into a low-rank part K* (of block-diagonal all-ones, one block for each cluster) and a 
remaining B* (the disagreements) - such that the number of non-zero entries in B* is as small as possible; 
i.e. B* is sparse. Finally, the problem we look at is recovery of the best K* when we do not observe all 
entries. The idea is depicted in Figure [TJ 

Convex Optimization Formulation: We propose to do the matrix splitting using convex opti- 



mization, an approach recently taken in Chandrasekaran et al. (2011); Candes et al. (2009); however, we 



establish stronger results for our special problem. Our approach consists of dropping any additional struc- 
tural requirements, and just looking for a decomposition of the given A + 1 as the sum of a sparse matrix 
B and a low-rank matrix K. In particular, let ^ bs be the set of observed entries, i.e. the set of elements 
of A that are known to be or 1; we use the following convex program 

min A ||B||i + IIKIL (1) 

B,K 

s.t. Vn oba (B + K)=Vn obs (I + A) 

Here, for any matrix M, the term "Pn obs (M) keeps all elements of M in o bs unchanged, and sets all other 
elements to 0; the constraints thus state that the sparse and low-rank matrix should in sum be consistent 
with the observed entries. ||B||i = Ylij 1S the t\ norm of the entries of the matrix, which is well-known 
to be a convex surrogate for the number of non-zero entries ||B||o. The second term ||K||* = ^2 s o- s (Ji) 
is the nuclear norm: the sum of singular values of K. This has been shown to be the tightest convex 
surrogate^] for the rank function (Fazel, 2002). Thus our objective function is a convex surrogate for 



In practice, one might be able to use the "failed" output with rounding as an approximate solution. In this paper, we 
focus on the performance of the exact unrounded algorithm. 

2 In particular, it is the l\ norm of the singular value vector, while rank is the £o norm of the same. 
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Figure 1: The adjacency matrix of a graph drawn from the planted partition model before and after 
proper reordering (i.e. clustering) of the nodes. The figure on the right is indicative of the matrix as a 
superposition of a sparse matrix and a low-rank one. 



the (natural) combinatorial objective A ||B||o + rank(K). The optimization problem ([!]) is, in fact, a 
semi-definite program (SDP) ( |Chandrasekaran et al. 2011| ). 



We remark on the above formulation, (a) This formulation does not require specifying the number of 
clusters; this parameter is effectively learned from the data. The tradeoff parameter A is artificial and can 
be easily determined: since K* has trace exactly equal to n, we simply choose the smallest A such that the 
trace of the optimal solution is at least n. This can be done by, e.g., bisection, which is described below, 
(b) It is possible to obtain tighter convex relaxations by adding more constraints, such as the diagonal 



entries ka = 1, K being positive semi-definite, or even the triangular inequalities K« + K 



K lk < 1. 



Indeed, this is done by Mathieu and Schudy (2010). We choose not to take this approach for the following 



reasons. First, and most importantly, even with the extra constraints, Mathieu and Schudy (2010) do not 



deliver better guarantees (cf. Table [I]). In fact, we show in Section 2.3 that our results are near optimal in 
some important cases, so tighter relaxations do not seem to provide additional benefits; there we elaborate 
on this point. Second, our formulation can be solved efficiently using Augmented Lagrangian Multiplier 



methods ( Lin et al. 2009 ) , which is due to the fact that the proximal function of the nuclear norm can be 



easily computed. This is no longer the case with the extra semi-definite and linear constraints, and solving 
it as a standard SDP is only applicable to small graphs. 

Definition: Validity: The convex program ([!]) is said to produce a valid output if the low-rank 
matrix part K of the optimum corresponds to a graph of disjoint cliques; i.e. its rows and columns can be 
re-ordered to yield a block-diagonal matrix with all-one matrices for each block. 

Validity of a given K can easily be checked via elementary re-ordering operations^} Our first simple, 
but crucial, insight is that whenever the convex program yields a valid solution, it is the disagreement 



3 If we re-order a valid K such that identical rows and columns appear together, it will become block-diagonal. 



minimizer. 

Theorem 1. For any A > 0, if the optimum of ([I]) is valid, then it is the clustering that minimizes the 
number of observed disagreements. 



Algorithm: Our method is given as Algorithm [TJ It takes the adjacency matrix of the network A 
and outputs either the optimal clustering or declares failure. 



Algorithm 1 Optimal-Cluster(A) 



32 v / po n 

while not terminated do 
Solve Q 

if solution K is valid then 

Output the clustering w.r.t K and EXIT, 
else if trace(K) > n then 

A <- A/2 
else if trace(K) < n then 

A <- 2A 
end if 
end while 
Declare Failure. 



To solve the optimization problem (|Tj), we recommend using the fast implementation developed in (Lin 



et al. , 2009), which is tailored for matrix splitting and takes advantage of the sparsity of the observations. 



Setting the parameter A is done via binary search, since any valid clustering matrix K is positive semidefinite 
and has trace(K) = IIKIL = n. The initial value of A is not crucial; we use A = „„ ]= — based on our 
theoretical analysis in the next sub-section, where po is the empirical fraction of observed pairs. Whenever 
the algorithm results in a valid K, we have found the optimal clustering (by Theorem [T]). 



2.2 Performance Analysis 

For the main analytical contribution of this paper, we provide conditions under which the above algorithm 
will find the clustering that minimizes the number of disagreements among the observed entries. In partic- 
ular, we characterize its performance under the standard and classical planted partition/stochastic block 
model with partial observations, which we now describe. 

Planted Partition Model with Partial Observations: Suppose n nodes are partitioned into r 
clusters, each of size at least K m i n . Let K* be the low-rank matrix corresponding to this clustering (as 
described above). The adjacency matrix A of the graph is generated as follows: for each pair of nodes 
in the same cluster, ajj =? with probability 1 — po, (Hj = 1 with probability pop, or = otherwise, 
independent of all others; similarly, for in different clusters, =? with probability 1 — po, dij = 1 
with probability poq, or = otherwise. 

Under the above model, the graph is observed at locations chosen at random with probability po, 
and in expectation a fraction of 1 — p (q) of these observations are in-cluster (across-cluster, respectively) 
disagreements. Let B* = Vti obs (A. + I — K*) be the matrix of observed disagreements for the original 
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clustering. Note that the support of B* is contained in Q b s . The following theorem says when our 
algorithm recovers the original clustering (K* , B* ) with high probability. Combined with Theorem [TJ it 
also shows that with high probability the original clustering is disagreement minimizing. 



Theorem 2. Let r 

probability at least 1 



max{l — p,q}. For any constant c > 0, there exist a constant C such that, with 



A 



l 

32-/np5 



cn , the original clustering (K*, B*) is the unique optimal solution of uy with 



provided that 



p Q (1 - 2r) > C— 2 



The theorem gives the condition that needs to be satisfied for recovery in terms of the three parameters 
that define problem: the minimum cluster size K m i n , the density gap p — q > 1 — 2t, and the observation 
probability po- We remark on these parameters. 

Minimum cluster size Kmin * Since the left hand side of the condition in Theorem [2] is no more 
than 1, it imposes a lower-bound K m i n = Q(y/n) on the cluster sizes. This means that our method can 
handle a growing number (0(y/n)) of clusters. The lower-bound is attained when 1 — 2r and po are both 
0(1), i.e., not decreasing as n grows. Note that all relevant works require a lower-bound at least as strong 
as ours (cf. Table [TJ. 

Density gap 1 — 2r: When po = 0(1), our result allows this gap to be vanishingly small, i.e., 

Cl ^ jf 7 " J , with larger if m i n allowing smaller gap. As we mentioned before, this matches the best available 

results (cf. Table]!]), including those in (Mathieu and Schudy, 2010) and (Oymak and Hassibi, 2011), which 
use tighter convex relaxations that are more computationally demanding. We note that applying existing 



results in the low-rank-plus-sparse literature (Candes et al. 2009; Li, 2012) leads to weaker results, where 
the gap is required to be 6(1). 



K 



2t = 0(1), our result only requires a vanishing fraction 
; larger K m i n allows smaller pq. As mentioned in the 



Observation probability pq: When 1 
of observations, i.e., po can be as small as 
related work section, this scaling is better than existing results we know of. Moreover, our result provides 
a explicit trade-off between the observation probability pq and the density gap (1 — 2r), which is new. In 
particular, po can go to zero quadratically faster then 1 — 2r. Consequently, treating missing observations 
as disagreements would lead to quadratically weaker results. This agrees with the intuition that handling 
missing entries with known locations is easier than correcting disagreements whose locations are unknown. 



Finally, we would like to point out that our algorithm has the capability to handle outliers. Suppose 
there are some isolated nodes which do not belong to any cluster, and they connect to each node in the 
clusters and to each other with probability at most r, with r obeying the condition in Theorem [2} Our 
algorithm will classify all these edges as disagreements - and hence automatically reveal the identity of 
each outlier. In the output of our algorithm, the low rank part K will have all zeros in the columns and 
rows corresponding to outliers - all their edges will appear in the disagreement matrix B. 
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2.3 Lower Bounds 



In this subsection, we discuss the tightness of Theorem[2j Consider first the case where K mm = 8(n), i.e., 
there are a constant number of clusters. We establish fundamental lower bounds on the density gap 1 — 2r 
and the observation probability po that are required for any algorithm to correctly recover the clusters. 

Theorem 3. Suppose all clusters have equal size K = @(n), and r = ©(1). Under the planted partition 
model with partial observations, for any algorithm to correctly identify the clusters with probability at least 
| ; we need 

Po (l-2r) 2 >C-, 
n 

where C > is an absolute constant. 



This theorem applies to any algorithm regardless of its computational complexity. It shows that, when 
Kmm = &{n), the requirement for 1 — 2r and po in Theorem [2] is optimal up to logarithmic factors, and in 
particular cannot be significantly improved by using more complicated methods. Moreover, to the best of 
our knowledge this is the first converse result that characterizes the fundamental tradeoff between po and 
1 - It. 



For the genera" 



Decelle et al. 



case -Kmm = 0(n), only part of the picture is known. Using non-rigorous arguments, 



(2011) show that 1 - 2r > 



K„ 



is necessary when r = 0(1) and the graph is fully observed; 
otherwise recovery is impossible or exponentially hard. According to this lower-bound, our requirement 
on the density gap 1 — 2r is probably tight (up to log factors) for all K m i n . However, a rigorous proof of 
this claim is still lacking, and seems to be a difficult problem. Similarly, the tightness of our condition on 
Po and the tradeoff between po and r is also unclear in this regime. 



3 Proofs 

3.1 Proof of Theorem [T] 

In this section, we prove Theorem [TJ in particular, that if the optimization problem ([!]) produces a valid 
low-rank matrix, i.e. one that corresponds to a clustering of the nodes, then this is the disagreement 
minimizing clustering. Consider the following non-convex optimization problem 

min A ||B||i + IIKIL (2) 

B,K 

s.t. P Qobs (B + K) = P Uohs (l + A) 
K is valid 

and let (B,K) be any feasible solution. Since K represents a valid clustering, it is positive semidefinite 
and has all ones along its diagonal. Therefore, it obeys HK^ = trace(K) = n. On the other hand, 
because both K — I and A are adjacency matrices, the entries ofB = I + A — Kin f2 bs must be equal 
to —1, 1 or (i.e. it is a disagreement matrix). Clearly any optimal B must have zero at the entries in 
O^bs- Hence ((B^ = ||B|| when K is valid. We thus conclude that the above optimization problem is 
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equivalent to minimizing ||B||o subject to the constraints in Q. This is exactly the minimization of the 
number of disagreements on the observed edges. Now notice that ([!]) is a relaxed version of @. Therefore, 
if the optimum of ([I]) is valid and thus feasible to ([2J, then it is also optimal to Q, the disagreement 
minimization problem. 



3.2 Proof of Theorem I 



3.2.1 Proof Outline and Preliminaries 



We now overview the main steps in the proof of Theorem [2j the following sub-sections provide details. 
Recall that we would like to show that K* and B* corresponding to the true clustering is the unique 
optimum of our convex program ([!]). This involves the following steps: 

Step 1: Show that it suffices to consider an equivalent model for the observation and disagreements. 
This model is easier to handle, especially when the observation probability and density gap are vanishing, 
which is the case considered in this paper. 

Step 2: Write down sub-gradient based first-order sufficient conditions that need to be satisfied for 
K*,B* to be the unique optimum of ([!]). In our case, this involves showing the existence of a matrix W 
- the dual certificate - that satisfies certain properties. This step is technically involved - requiring us to 
delve into the intricacies of sub-gradients since our convex function is not smooth - but otherwise standard. 



Luckily for us, this has been done by ( Chandrasekaran et al. 2011 Candes et al. 2009 Li 2012) 



Step 3: Using the assumptions made on the true clustering and its disagreements (K*, B*), construct 
a candidate dual certificate W that meets the requirements - and thus certifies K* , B* as being the unique 
optimum. 



The crucial third step is where we go beyond the existing literature on matrix splitting (Chan- 



drasekaran et al. 2011; Candes et al. 2009 Li 2012). These results assume the observation probability 



and/or density gap is at least a constant, and hence do not apply to our setting. Here we provide a refined 
analysis, which leads to much more powerful performance guarantees than those that could be obtained 
via a direct application of existing sparse and low-rank matrix splitting results. 

Next, we introduce some notations used in the rest of the proof of the theorem. 

Definitions related to K*: By symmetry, the SVD of K* is of the form USU T . We define the 
sub-space T = {UX T + YU T : X, Y G K n *P} to be the span of all matrices that share either the same 
column space or the same row space as K*. For any matrix M S M nxn , we can define its orthogonal 
projection to the space T as P r (M) = UU T M + MUU T - UU T MUU T . We also define the projection 
onto T^, the complement orthogonal space of 7~, as Vf± (M) = M — P-yi^sA). 

Definitions related to B* and partial observation: Let Q* = {{i,j) ■ B* ^ 0} be the set of 

matrix entries corresponding to the disagreements. Recall that f^obs is the set of observed entries. For 
any matrix N and entry set £lo, we let Vq (N) 6 M nxn be the matrix obtained from N by setting all 
entries not in the set to zero. We write £lo ~ Bero(p) if the entry set fio does not contain the diagonal 
entries, and each pair and (i ^ j) is contained in f^o with probability p, independent all others; 
0,q ~ Beri(p) is defined similarly except that contains all the diagonal entries. Our assumption implies 
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ftobs ~ Beri(po) and ft* ~ Bero(r). 

Norms: ||M|| and ||M||_p represent the spectral and Frobenius norm of the matrix M respectively 
and llMlloo 



3.2.2 Step 1: Equivalent Model for Observation and Disagreements 

Notice that the probability of success is completely determined by the distribution of (ft Q b s ,B*). The first 
step is to show that it suffices to consider an equivalent model for generating (ft b s ,B*), which results in 



the same distribution but is easier to handle. This is in the same spirit as ( Candes et al. , 2009 ) (Theorem 



2.2 and 2.3 therein) and (Li, 2012) (Section 4.1 therein). In particular, we consider the following procedure: 



1. Let T ~ Ben(p (l - 2r)), and ft ~ Ber ( ^ffi^ ). Let ft obs = TUfi. 

2. Let S be a symmetric random matrix whose upper-triangular entries are independent and satisfy 
F{s id = 1) = F{s id = -1) = \. 

3. Define fl' C fl as fl' = : (hJ) G s i,j = 1 ~~ ^^Ij}- ^ n °t ner words, ft' is the entries of S 
whose signs are consistent with a disagreement matrix. 

4. Define ft* = ft'/T, and f = O/ft*. 

5. Let B* = Pq* (S). 

It is easy to verify that (ft b S) B*) has the same distribution as the original model. In particular, we have 
P((i, j) £ ft obs ) = po), P((t,i) G 0*, (i, j) G ^obs) = Pot and P((t, j) £ ft*, (i, j) g ft obs ) = 0, and observe 
that B* is completely determined by its support ft*. 

The advantage of the above model is that T and ft are independent of each other, and S has random 
signed entries. This facilitates the construction of the dual certificate, especially in the regime of vanishing 
p and Q — t) considered in this paper. We use this equivalent model in the rest of the proof. 

3.2.3 Step 2: Sufficient Conditions for Optimality of (K*,B*) 

We state the first-order conditions that guarantee K* and B* to be the unique optimum of ([!]) with high 
probability. Here and henceforth, by with high probability we mean with probability at least 1 — cn -10 



for some constant c > 0. The following lemma follows from Theorem 4.4 in Li (2012) and the discussion 
thereafter. 

Lemma 1 (Probabilistic Sufficient Optimality). Suppose Wjf^p^'PT'Pr'PT ~'Pt\\ ^ \- ThenK.* and B* 
are unique solutions to with high probability provided that there exists W G M. nxn such that 

1. ||7V(W + A^S-UU T )|| F < ^ 

2. ||7> T x(W + A7>nS)|| < \ 
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3. Prc(W) = 

Lemma 3 in the appendix guarantees that the condition II [x^^ PtPtPt ~^Vll — \ is satisfied with 
high probability under the assumption of Theorem [2] Thus it remains to show the existence of the dual 
certificate W. 



3.2.4 Step 3: Dual Certificate construction 



We use a variant of the so-called Golfing Scheme ((Candes et al. 2009; Gross, 2009)) to construct W. 



Our application of Golfing Scheme, as well as its analysis, is different from ( Candes et al. , 2009 ) and leads 
to stronger guarantees. In particular, we go beyond existing results by allowing the fraction of observed 
entries and the density gap to be vanishing. 

By definition, T obeys T ~ Beri(po(l — 2t)). Observe that V may be considered to be generated by 
T = Ui<fc<fc o rfc, where the sets Tk ~ Beri(g) are independent; here the parameter q obeys po(l — 2r) = 
1 — (1 — q) k °, and ko is chosen to be [51ogn]. This implies q > po(l — 2r)/ko > Co " ° g - for some 

l—l min 

constant Co, with the last inequality following from the assumption of Theorem [2] For any random entry 
set Qq ~ Beri (p) , define the operator TZn by 

n 

Uq (M) = ^ m i,i e i e I +P 1 S ij m hi ( eie J + e 3 e i) ' 

i=l l<i<j<n 

where 5^ = 1 if G and otherwise, and is the z-th standard basis - i.e., the nx 1 column vector 
with 1 in its i-th entry and elsewhere. 

We now define our dual certificate. Let W := Wfc , where Wfc is defined recursively by setting 
Wo = and for all k = 1, 2, ... , fco, 

W fc = W fc _x + -Rr k V T (UU T - XP T {Vn{S)) - W fc _i) . 

Clearly the equality condition in Lemma [T] is satisfied. It remains to show that W also satisfies the 
inequality conditions with high probability. The proof makes use of the auxiliary lemmas given in the 
appendix. For convenience of notation, we define the quantity A k = UU T - XP r (^n(S)) - VrO^k), and 
write 

k 

Y[(r T - p r K T ,V T ) = (TV - v T n Vk v T ) ■■■(V T - V t K t J>t), 
1=1 

where the order of multiplication is important. Observe that by construction of W, we have 

k 

A fc = Y[(V T -VTKr i r T )QJU T -\V T Vn{S)),Vk = l,...,k , (3) 

i=i 

k 

W fc0 = J^^Afc-L (4) 

k=l 
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Inequality 1: Bounding ||P r (W + APqS - UU T )|| F . 
We have the following geometric convergence thanks to ([3]). 

7V(W + XVnS - UU 1 

= H A fcollF 

< (j[ l^r " 'Pm^'PTl^j ||UU T - XV T Vn(S)\\ F 

(a) (6) 

< e- fco (||UU r || J! , + A ||Wn(S)M < n~ 5 (n + A-n) 

< (1 + A)n- § ^A. 

Here, (a) uses Lemma [3] with ei = e _1 , (b) uses our choices of A and ko and the fact that ||7V^to(S)||j? < 
||^(S)|| F < n, and (c) uses A > 

Inequality 4: Bounding ||7*r(W)|| . 

We have 

\Wv(W)\\ QO = \\Vr(W k0 )\\ 



(a) 



k> 



00 
fco 



< £||ttrA-lllce < g" 1 J]) II A 



fc-i 



fc=l 



fc=l 



Here, (a) uses Q. We proceed as follows 

fco 



(6) 



(c) 

< 



fc=l 

£ 

fc=l 

fco 



fc-1 



[J(7? r _ Vt1 z ViVt ){ijuT _ \r T Vn(S)) 



8=1 



/ 1 \ 

E U ll uuT - A ^n(S) 

fc=i ^ ' 



(d) 1 

Ji -min 



/ po n log n 



K 



-2 ' 



(5) 



where (b) uses ^ , (c) uses Lemma [5] and (d) uses Lemma |6j It follows that 

HPrWIL 
1/1 log n 



< 



< 



9 V^min -^min 



1 



Po(l-2r) \K B 



+ A, 



jponlogn 
K 2 - 

mm 
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where the last inequality holds under the assumption of Theorem [2] 
Inequality 2: Bounding \\V T ±(W + XVn{S))\\. 
Observe that by triangle inequality, we have 

||p rJ .(W + A7> n (S))|| <\\\r r ^Pn(S))\\ + \\P T± (W k0 ) 



For the first term, a standard argument about the norm of a matrix with i.i.d. entries (e.g., see Vershynin 



(2007)) gives 



A||7Vx(^(S))|| < X\\Vn(S)\\ < —l—.J 2P0TH < I 
It remains to show that the second term is bounded by |. To this end, we observe that 



(a) 



ll?V(w fco )ll 

k 

£||P rJ .(rc rfc A fc _i-A fc _i)|| (6) 



fc=i 

k=l 

(*) / nlogn ^. 

- c y— S l|Ai - ilL 



< c / fcprelogn / 1 A jponlogn 



Po(l-2r) 1 i4T mi „ V ^ 



2 

min 



1 

< -. 



where (a) uses Q and the fact that G T, and (b) uses Lemma [4j 
This completes the proof of Theorem [2j 

4 Experimental Results 

We explore the performance of our algorithm as a function of various graph parameters (n, K,pq,t) via 
simulation. We see that the performance matches well with the theory. 

In the experiment, each test case is constructed by generating a graph with n nodes divided into 
clusters of equal size i^miru and then placing a disagreement on each pair of node with probability r, 
independent of all others. Each node pair is observed with probability pq. The optimization problem ([!]) 



is solved using the fast algorithm in (Lin et al. , 2009) with A set via binary search described in Algorithm 



[T] We check if the algorithm successfully outputs a solution that equals to the underlying true clusters. In 
the first set of experiments, we fix r = 0.2 and K m i n = n/4 and vary (po,n). For each (po,n), we repeat 
the experiment for 5 times and plot the probability of success in the left pane of Fig. [2| 
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250 



Figure 2: Simulation results verifying the performance of our algorithm as a function of the observation 
probablity po and the graph size n. The left pane shows the probability of successful recovery under 
different po and n with fixed r = 0.2 and K m [ n = n/4; each point is an average over 5 trials. After proper 
rescaling of the x-axis, the curves align as shown in the right pane, indicating a good match with the 
theory. 




Figure 3: Simulation results verifying the performance of our algorithm as a function of the observation 
probablity po and the cluster size K m { Q , with n and r fixed. The results indicate a good match with the 
theory. 



One observes that our algorithm has better performance with larger po and n, and the success prob- 
ability exhibits a phase transition. Theorem [2] predicts that, with r fixed and K m \ n = n/4, the transition 
occurs at po cx n J?| " cx °^ n ; in particular, if we plot the success probability versus the control parameter 

min 

l ^z n , all curves should align with each other. Indeed, this is precisely what we see if we use as the 
control parameter (right pane of Fig. [2]). This shows that Theorem [2] gives the correct scaling between po 
and n up to an extra log factor. 

In a similar fashion, we run another three sets of experiments with the following settings: (1) n = 1000 
and r = 0.2 with varying (pcb-^min); (2) K m \ n = n/4 and po = 0.2 with varying (r, n); (3) n = 1000 and 
Po = 0.6 with varying (T,K m i a ). The results are shown in Fig [3j [4] and [5] note that each x-axis represents 
a control parameter chosen according to the scaling predicted by Theorem [2] Again we observe that all 
the curves roughly align, indicating a good match with the theory. In particular, by comparing Fig. [2] 
and [4] (or Fig. [3] and [5]) one verifies the quadratic tradeoff between observations and disagreements (po vs. 
1 — 2r) as predicted by Theorem [2| 
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Figure 4: Simulation results verifying the performance of our algorithm as a function of the disagreement 
probablity r and the graph size re, with pq and ifmin = re/4 fixed. The results indicate a good match with 
the theory. 




Figure 5: Simulation results verifying the performance of our algorithm as a function of the disagreement 
probablity r and the cluster size K m i n , with re and po fixed. The results indicate a good match with the 
theory. 



Finally, we compare the performance of our method with spectral clustering, a popular method for 
graph clustering. For spectral clustering, we first impute the missing entries of the adjacency matrix with 
either zeros or random 1/0's. We then compute the largest k principal components of the adjacency matrix, 
and run fc-means clustering on the principal components (Von Luxburg, 2007); here we set k equal to the 
correct number of clusters. The adjacency matrix is generated in a similar fashion as before using the 
following parameters: re = 2000, K m { n = 200, and r = 0.1. We vary the observation probability po and 
plot the success probability (Fig. [6]). It can be observed that our method outperforms spectral clustering 
with both imputation schemes; in particular, it requires fewer observations. 



5 Conclusion 



We proposed a convex optimization formulation, based on a reduction to decomposing low-rank and sparse 
matrices, to address the problem of clustering partially observed graphs. We showed that under a wide 
range of parameters of the classical planted partition model, our method guarantees to find the optimal 
(disagreement-minimizing) clustering. In particular, our method succeeds under higher levels of noise 
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Figure 6: Comparison of our method and spectral clustering for partially observed graphs under different 
observation probability po. For spectral clustering, two imputation schemes are considered: Spectral (Zero), 
where the missing entries are imputed with zeros, and Spectral (Rand), where they are imputed with 0/1 
random variables with symmetric probabilities. The result shows that our method recovers the underlying 
clusters with fewer observations. 

and/or missing observations than existing methods in this setting. The effectiveness of the proposed 
method and the scaling of the theoretical results are validated by simulation studies. 

This work is motivated by graph clustering applications where obtaining similarity data is expensive 
and it is desirable to use as few observations as possible. As such, potential directions for future work 
include considering different sampling schemes such as active sampling, as well as dealing with sparse 
graphs with very few connections. 
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Appendices 



A Auxiliary Lemmas 



In this section, we provide several auxiliary lemmas required in the proof of Theorem 2} We will make use 
of the non-commutative Bernstein inequality. We use a version given in (Tropp, 2010). 



Lemma 2. Tropp (2010) Consider a finite sequence {Mj} of independent, random n\ x n 2 matrices that 



satisfy the assumption EM; = and ||Mj|| < D almost surely. Let 



a = max 



Then for all t > we have 

P[||^M, 



> t 



< (m + n 2 ) exp 



2<r 2 + 2£>i/3 J ' 



< 



(ni + n 2 ) exp 



3t* 
8a 3 



for t < 



D ' 



Jni +n 2 )exp(-^) , fort>%. 
Remark 1. When n\ = n 2 = 1, this becomes the standard two-sided Bernstein inequality. 

We will also make use of the following estimate, which follows from the structure of U. 



Vr^e] 



|UU T ei || 2 + ||UU T ei || 2 



|UU T ei|| 2 ||UU T e,-|| 2 < 



2n 

mm 



(7) 



, VI < i,j < n 
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The first auxiliary lemma is similar to Theorem 4.1 in Candes et al. (2009), but applies to the symmetric 



matrix case. Our proof is different from Candes et al. (2009). 



Lemma 3. Suppose f^o is a set of entries obeying Qq ~ Ber\(p). Consider the operator Pj- — PtTZ^i P-ts. 
For some constant Co > 0, we have 

\\V T ~ V r nn Vr\\ < e x 
with high probability provided that p > Co "^| ra and e± < 1. 



Proof. For each (i,j), define the indicator random variable 5^ = l{(ij)eo }- We observe that for any 
matrix M G T 

(V T nn Vr - V T ) M 
= E (p~% ~ X ) ( V r(eiej), M) -Prietej + e 3 ej) 

l<i<j<n 
l<i<j<n 

Here Sij : R nxn H > M. nxn is a linear self-adjoint operator with E [Sij] = 0. We also have the bounds 



|<%[| < P 



Pr( e i e j ) T > r( e i e j + e j e i 



< P' 1 ■ 2 



V r {e t e\ 



< 



An 



and 



E 



E 4( M ) 

l<i<j'<n 



l<i<j<n 



= (P^-l) 



E 2 T ? r(eiej) mijVr(eieJ + eje] 

l<i<j<n 



E 2 PrteeJ) 

l<i<j'<n 



E m i,j( e i e ] + e j e J) 
l<i<j'<n 



4n 



- 1) ^ l|M|| F , 



K 2 . 

mm 
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which means 



E 



L^\<i<j<n "ij 



< jA n ; here we use the fact that VT( e i e J) = ('^ , T( e i e 7)) anc ^ M * s 
symmetric. An application of the Bernstein inequality (first inequality of ([7])) then yields 



provided p > 6 f^." log " and ei < 1. 

min 1 



l<i<j'<n 



> ei 



< 2n 



2-2/3 



□ 



The next lemma is similar to Theorem 6.3 in (Candes and Recht, 2009) but applies to the symmetric 
matrix case. The proof is again different. 

Lemma 4. Suppose is a set of entries obeying Qq ~ Ber\(p), and M is a fixed nxn symmetric matrix. 
Then for some constant Cq > 0, we have 



(X-^o)M|| < Jco^^HMIloo, 



with high probability provided that p > Cq^^. 

Proof. Define Sij as before. Notice that 

7£n (M) - M = - l)my (aej + ejej^j = ^ % 

i<j i<j 

Here the symmetric matrix Sij € M nxn satisfies E [Sij] = 0, \\Sij\\ < 2p~ l and 



E 








i<j 





p- 1 - 1 



E 

Kj 



2 I T T 



diag E m L'---'E m ^' 



< (p- 1 - 1) n ||M||^ < 2p~ l n ||M| 



2 

oo 



When p > 16 ^ 3 1 ° g n , we apply the Bernstein inequality (first inequality of (IT])) and obtain 



/ 16/3relogn „, ,„ 
> W — — M 



< 2n exp 

< 2^. 



o 16/3n logn ||i\/r||2 % 
J " 3p ll 1V1 lloo 

8- ^ llMlll 



The conclusion follows by choosing a sufficiently large j3. 



□ 



The third lemma is similar to Lemma 3.1 in (Candes et al. 2009), but extended to the symmetric 
matrix case. 
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Lemma 5. Suppose Qq is a set of entries obeying Qq ~ Ber\{p), and M £ T is a fixed symmetric n x n 
matrix. Then for some constant Cq > 0, we have 

||(7y-7y^PT)M||oo < esllMHoo, 

with high probability, provided that p > Co 11 and £3 < 1. 

3 min 



Proof. Define 5u as before. Fix an entry index (a,b). Notice that 

i<j 

A 



Kj 



where IE = 0. We have the bounds 



< 



F 

An 



m 



1, .1 



K 2 - p l|M|lo ° 



and 



E 



[(p- 1 ^ - l) 2 ] mfj (V T (e t ej + e je j) , e a 

i<j 



< {p- 1 - 1) \\MWl Ue] + e 3 ej, V T {e a 



i<j 



< 2 (p- 1 - 1) \\M\\l \\Vr(e a eJ) 

< 2 (p- 1 -l)^||M| |2 



< 



4"- ,,,,,,2 
"Mi . 



K 2 ■ p 



When p > log 2 " and £3 < 1, we apply the standard Bernstein inequality (first inequality of (Pft)) and 
obtain 

(? r ^ Q ? T M - P T M) aib | > e 3 UMllJ < 2exp 
Union bound then yields 



3e 2 ||M|| 2 



in 



'K 2 



IMII 



< 2n 



-2/3 



[||7Vftn 7VM - ^r M lloo > ||M|| J < 2n 



2-2/3 



□ 



The last lemma bounds the matrix infinity norm of T > f'Pn(S). 
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Lemma 6. Suppose Q ~ Ber (j_ po+2p0 T 
assumption of Theorem\R for some constant C$, we have 



and S G 



has i.i.d. symmetric ±1 entries . Under the 



||Wn(S)||oo<C 01 



Iponlogn 
K 2 ■ 

mm 



with high probability. 



Proof. By triangle inequality, we have 



WVrVnWlL < ||uu^(s)L + ||^(S)UU T || oo + ||UU T 7> n (S)UU 2 



so it suffices to show that each of these three terms are bounded by C. / goniggn w.h.p. for some constant 

V min 

C. Under our assumption on f2 and S, each pair of symmetric entries of Vn(S) equals ±1 with probability 

P — i-po+2p r anc ^ ^ otherwise; notice that p < ^ since r < |. Let be the ith row of UU T . Prom 

the structure of U, we know 



,(*) 



< 



1 



and 



< 



Vi. 



We now bound ||UU 7 3 n(S)|| . For simplicity, we focus on the (1, 1) entry of UU Pji(S) and denote 
this random variable as X. Observe that X = £V s. 1 ^ {Vn{S)) i l , and 



E 



4 1} (%(s)),,i 

^ (Pn(S)) w 



< k (1) l< 



Var (X) 

Standard Bernstein inequality thus gives 

< 2 exp 



a.s. 



E (-S 1, ) 2 2p< 

i:(i,l)en 



jpon\ogn 



K 2 ■ 

mm 



^ ppnlogre ^ ^ ( po + 2C\/p n log n 



K 2 . 

mm 



W 2 
"J-"" min 



Under the assumption of Theorem [gj the r ight hand side is bounded by 2n 12 . It follows from a union 
bound that ||UU T P n (S)|| < C A / ^°"'° g " w.h.p. Clearly the same bound applies to ||7^(S)UU T || . 

II II OO \/ _/\ . I I 1 1 oo 

V min 

Finally, let K be the size of the cluster that node j is in; observe that due to the structure of UU T , we 
have 

(UU T P S] (S)UU T ). j = £ (uu T ^(s)).^ (uu T ) ( < 1 if - ||uu T Pn(s) 



which implies ||UU r 'PQ(S)UU" r || < ||UU T, Pn(S) II . This completes the proof of the lemma. 



□ 
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B Proof of Theorem [3] 



Proof. We use a standard information theoretical argument, which improves upon the proof of Theorem 5 
in (Chaudhuri et al. 2012). For simplicity we assume n/K is an integer. Let T be the set of all possible 
partition of n nodes into n/K clusters of equal size K. Using Stirling's approximation, we have 



M = \T\ 



n 



(n/K)\\Kj\ K 



n-K 



K 



,>(—) 
K) ~ \3KJ 



n \»(i-"sO 



when K = 0(n) 



Suppose the clustering Y is chosen uniformly at random from J 7 , and the graph A is generated from 
Y according to the planted partition model with partial observations, where we use aij =? for unobserved 
pairs. We use P^iy to denote the distribution of A given Y . Let Y be any measurable function of the 
observation A. A standard application of Fano's inequality (Yang and Barron 1999) gives 



sup P 



Y / Y|Y 



> 1 



1 Ey(^y(%7 D (P A|Y(1) ||P A | Y(2) ) + log 2 



log M 



(8) 



where -D(-||-) is the KL-divergence. We now upper bound this divergence. Given Y, the Q>ij S 3X6 indepen- 
dent of each other, so we have 

D (^aiywII^aiyc 2 ) 



E D ( P -«|YWl|P^|YWj- 

(m) 



For each pair the KL-divergence is zero if y, 



(i) 



V 



(2) 

ij > 



and otherwise 



D 



^ n m Po(l-r) pqt l-p 
< p (l-T)log h pqt log — r + (l-po)log- 



Por 
1-t 



>o(l - r) 



1 -Po 



Po(l- 2T)log 

T 

1-T 



< Po(l-2r 



< c 2 p (l-2r) 2 



T 



where c 2 > is an absolute constant and the last inequality holds when r = 0(1). Let N be the number 



of pairs such that j/j 1 ^ / vfj '■ When K = ©(n), we have 



JV<|{(i,i):»g ) = l}U{(i,j):i4 2) = l}|<n 3 . 



It follows that D fP A |Yd) IIPaiyW J < N ■ c 2 p (l - 2r) 2 < c 2 n 2 p (l - 2r) 2 . 

Combining pieces, for the left hand side of ([8]) to be less than 1/4, we need 



Po (1-2t) 2 >C-. 

n 



□ 
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